-
Notifications
You must be signed in to change notification settings - Fork 13.3k
server : return HTTP 400 if prompt exceeds context length #16486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hmm that's strange, we have a specific error type for this, llama.cpp/tools/server/server.cpp Lines 1268 to 1271 in 56b4795
We also have this test case: llama.cpp/tools/server/tests/unit/test_chat_completion.py Lines 393 to 408 in 56b4795
I'm wondering which input leads to the 200 code that you mentioned? |
The issue occurs only in streaming mode. In non-streaming it correctly returns 400. |
aac559d
to
1d8b16c
Compare
I have added a new test which covers exceeding the context in streaming mode. |
In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case. This patch fixes this problem and makes the server return HTTP 400 in such cases.
1d8b16c
to
d08f91a
Compare
inputs = tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, prompt, true, true); | ||
} | ||
|
||
const size_t n_ctx_slot = ctx_server.n_ctx / ctx_server.params_base.n_parallel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking maybe this check can better be done inside launch_slot_with_task
? There you will have access to slot.n_ctx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there is no way to return non-200 status code once you call res.set_chunked_content_provider(...)
. That's why I am doing the check before that.
json error_data = format_error_response("the request exceeds the available context size, try increasing it", ERROR_TYPE_EXCEED_CONTEXT_SIZE); | ||
error_data["n_prompt_tokens"] = n_prompt_tokens; | ||
error_data["n_ctx"] = n_ctx_slot; | ||
res_error(res, error_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is handled inside launch_slot_with_task
, you can call send_error(slot, ".....", ERROR_TYPE_EXCEED_CONTEXT_SIZE);
, which should simplify things a bit
In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case.
This patch fixes this problem and makes the server return HTTP 400 in such cases.